MY_UNIQNAME = 'Tabbie'
YouTube provides a list of trending videos on it's site, determined by user interaction metrics such as likes, comments, and views. This dataset includes months of daily trending video across five different regions: the United States ("US"), Canada ("CA"), Great Britain ("GB"), Germany ("DE"), and France ("FR").
This data set includes 721 Pokemon, including their number, name, first and second type, and basic stats: HP, Attack, Defense, Special Attack, Special Defense, and Speed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
youtube_US = pd.read_csv('youtube-new/USvideos.csv')
youtube_US
US_comments = youtube_US["comment_count"]
sns.distplot(US_comments, kde=True, color = "darkmagenta").set(xlim=(0, 200000))
The histogram of the comments on videos from the country US is a positively skewed histogram showing that almost all the values are on the left most range of the histogram. The maximum frequency of the data is between the range 0-25000.
US_views = youtube_US["views"]
sns.distplot(US_views, kde=True, color = "green").set(xlim=(0,20000000))
The histogram of the views on videos from the country US is a positively skewed histogram showing that almost all the values are on the left most range of the histogram. The maximum frequency of the data is between the 0-50000000.
US_likes = youtube_US["likes"]
sns.distplot(US_likes, kde=True, color = "darkblue").set(xlim=(0, 500000))
The histogram of the likes on videos from the country US is a positively skewed histogram showing that almost all the values are on the left most range of the histogram. The maximum frequency of the data is between the range 0-100000.
US_dislikes = youtube_US["dislikes"]
sns.distplot(US_dislikes, kde=True, color = "red").set(xlim=(0, 200000))
The histogram of the dislikes on videos from the country US is a positively skewed histogram showing that almost all the values are on the left most range of the histogram. The maximum frequency of the data is between the range 0-25000.
US_comments_log = np.log(US_comments + 0.0001)
sns.distplot(US_comments_log,kde=True, color = "purple")
sns.distplot(US_comments_log,kde=True, color = "purple").set(xlim=(-10, 10))
The log histogram of comments shows difference in the structure of the data frequency.The histogram is left skewed and we can observe that log transformation of the comment_count column values of the youtube_US dataframe reduces the zero count in the data and pushes the data distribution to the right. The maximum frequency of data is now at 7.5.
US_views_log = np.log(US_views + 0.0001)
sns.distplot(US_views_log,kde=True, color = "darkgreen")
sns.distplot(US_views_log,kde=True, color = "darkgreen").set(xlim=(12, 16))
The log histogram of views shows difference in the structure of the data frequency.The histogram is left skewed and we can observe that log transformation of the views column values of the youtube_US dataframe removes the zero count in the data and pushes the data distribution to the right. The maximum frequency of data is now at 14.0.
US_likes_log = np.log(US_likes + 0.0001)
sns.distplot(US_likes_log,kde=True, color = "darkblue")
sns.distplot(US_likes_log,kde=True, color = "darkblue").set(xlim=(0, 15))
The log histogram of likes shows difference in the structure of the data frequency.The histogram is left skewed and we can observe that log transformation of the likes column values of the youtube_US dataframe reduces the zero count in the data and pushes the data distribution to the right. The maximum frequency of data is now at 10.
US_dislikes_log = np.log(US_dislikes + 0.0001)
sns.distplot(US_dislikes_log,kde=True, color = "darkred")
sns.distplot(US_dislikes_log,kde=True, color = "darkred").set(xlim=(0,8))
The log histogram of dislikes shows difference in the structure of the data frequency.The histogram is left skewed and we can observe that log transformation of the dislikes column values of the youtube_US dataframe reduces the zero count in the data and pushes the data distribution to the right. The maximum frequency of data is now at 6.5.
sns.pairplot(youtube_US,vars=['comment_count','views','likes','dislikes'])
While observing the scatterplots in the US videos pairplot above, they appear as waterspray or fireworks. The data is clustered at zero and spreads along the x and y axes.The histograms are right skewed. Both the scatterplots and the histograms indicate that the maximum frequency of data is near zero.
youtube_CA = pd.read_csv('youtube-new/CAvideos.csv') #Canada
sns.pairplot(youtube_CA,vars=['comment_count','views','likes','dislikes'])
youtube_GB = pd.read_csv('youtube-new/GBvideos.csv') #GreatBritain
sns.pairplot(youtube_GB,vars=['comment_count','views','likes','dislikes'])
youtube_DE = pd.read_csv('youtube-new/DEvideos.csv') #Germany
sns.pairplot(youtube_DE,vars=['comment_count','views','likes','dislikes'])
youtube_FR = pd.read_csv('youtube-new/FRvideos.csv') #France
sns.pairplot(youtube_FR,vars=['comment_count','views','likes','dislikes'])
While observing the above four pairplots, they do show similarity to the US videos pairplot. The scatterplots in all the pairplots appear as waterspray or fireworks. The data is clustered at zero and spreads along the x and y axes.The histograms are right skewed. Both the scatterplots and the histograms indicate that the maximum frequency of data is near zero.
A heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.
Seaborn makes it easy to create a heatmap with seaborn.heatmap()
DataFrame.corr(). That is, if your dataframe is called df, use df.corr().seaborn.heatmap(), and annotate it with the parameter annot=True.GB_corr = youtube_GB.corr()
GB_corr
sns.heatmap(GB_corr, annot = True)
The seaborn heatmap is a large square comprising of small squares of various shades and variables aligned on the bottom and left side of it. The squares in the heat map represent the correlation coefficient of two variables between the variables seen at the bottom and on the left side. The shade bar on the right helps us understand how strong is the correlation between the two variables in each box. The darker the shade the more negative is the correlation as we can see in the shade bar on the right. Each square has a numerical value in it which gives the exact correlation coefficient of the respective two variables. We can observe that the most negatively correlated variables of the Great Britain youtube videos dataframe are category_id and views with a numerical value of -0.17. The most positively correlated variables are views and likes with a numerical value of 0.8. The second most positively correlated variables are comment_count and dislikes with a numerical value of 0.77. Comment_count and likes have a correlation coefficient of 0.74 and that of dislikes and views is 0.39. Hence we understand that people still watch the videos even if they dislike it or for that matter if others dislike it.
vid_views = smf.ols('views ~ C(video_error_or_removed)', data = youtube_US).fit() #y~x, y represents outcome/dependent variable and x represents independent variable
vid_views.summary()
vid_views_anova = sm.stats.anova_lm(vid_views, typ=2)
print(vid_views_anova)
In the regression table with views as the outcome/dependent variable and videos error/removed as the independent variable we can observe the R-squared value is 0 which means the regression model is not a good fit. Even though the F-statistic is greater than 0.05, the p value is 0 which is less that 0.05 and that establishes that the two variables in the model are significantly different and the model is a good fit. Therefore we accept that there is variability in the model as the p-value is more stronger than F-statistic for interpretation even though the R-square value is 0.
commd_views = smf.ols('views ~ C(comments_disabled)', data = youtube_US).fit()
commd_views.summary()
commd_views_anova = sm.stats.anova_lm(commd_views, typ=2)
print(commd_views_anova)
In the regression table with views as the outcome/dependent variable and comments disabled as the independent variable we can observe the R-squared value is 0 which means the regression model is not a good fit. Even though the F-statistic is greater than 0.05, the p value is 0 which is less that 0.05 and that establishes that the two variables in the model are significantly different and the model is a good fit. Therefore we accept that there is variability in the model as the p-value is more stronger than F-statistic for interpretation even though the R-square value is 0.
ratd_views = smf.ols('views ~ C(ratings_disabled)', data = youtube_US).fit()
ratd_views.summary()
ratd_views_anova = sm.stats.anova_lm(ratd_views, typ=2)
print(ratd_views_anova)
In the regression table with views as the outcome/dependent variable and ratings disabled as the independent variable we can observe the R-squared value is 0 which means the regression model is not a good fit. Even though the F-statistic is greater than 0.05, the p value is 0 which is less that 0.05 and that establishes that the two variables in the model are significantly different and the model is a good fit. Therefore we accept that there is variability in the model as the p-value is more stronger than F-statistic for interpretation even though the R-square value is 0.
Pok = pd.read_csv('pokemon/Pokemon.csv')
Pok
pok_corr = Pok.corr()
pok_corr
sns.heatmap(pok_corr, annot = True)
From the heat map above we can see that Defense and Speed have the lowest correlation coefficient with the darkest coloured box and the numerical value being 0.015. The highest correlation coefficient is seen between Defense and Sp.Def, Sp.Atk and Sp.Def with the numerical value of 0.51.
type1_pok = Pok['Type 1'].unique()
type1_pok
for x in type1_pok:
dist_pok = Pok[Pok['Type 1'] == x]
fig_pok= sns.pairplot(dist_pok, vars = ['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed'])
fig_pok.fig.suptitle('Figure: Pokemon of type -'+ x, verticalalignment ='top', fontsize = 25, fontweight = 'bold' )
The pairplots of each type of Pokemon gives us some eye-catching insights. The scatterplot of Flying type of pokemon shows that they are very less in number than the other types of pokemons. The scatterplot of Normal type of pokemon is densly populated thereby showing that they are high in number as compared to other type of pokemons. The flying pokemon with 120 speed are have highest frequency along with high frequencies of other abilities among all the other types of pokemons.Dragon, Ice, Ground and Bug also show fairly higher frequencies of high HP among all the types of pokemons. Some of the Steel type of pokemons display exceptional self defense ability of 150. I believe Flying, Fairy, Dragon and Electic type of pokemons are among strong fighting pokemons as their histograms show right skewness to good extent.
each_gens = Pok['Generation'].unique()
each_gens
for x in each_gens:
all_gens = Pok[Pok['Generation'] == x]
gen_group = sns.pairplot(all_gens, vars=['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed'])
gen_group.fig.suptitle('Figure: Generation - '+ str(x) , verticalalignment='top', fontsize=20, fontweight = 'bold')
Upon observing the histograms for all the generations, they show similar spread and shape. The speed ability of all the generations show high frequencies in the range 50-80 with an exception of Generation 2 that shows between 40-50 and Generation 4 that shows between 80-100. Special defense ability shows higher freqencies of 80 in almost all the generations. Attack has the highest frequency of approximately 60 in all the six generations. Therefore I conclude that designers didn't try to use different ability points in different generations.